Conversation
Fixes #51 Add support for JavaScript code analysis using tree-sitter. * Add `api/analyzers/javascript/analyzer.py` implementing `JavaScriptAnalyzer` class using tree-sitter for JavaScript. - Implement methods for first and second pass analysis. - Use tree-sitter to parse JavaScript code. - Extract functions and classes from JavaScript code. - Connect entities in the graph. * Update `api/analyzers/source_analyzer.py` to include `JavaScriptAnalyzer` in the analyzers list. * Add `tree-sitter-javascript` dependency to `pyproject.toml`. * Add utility functions for JavaScript analysis in `api/analyzers/utils.py`. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/FalkorDB/code-graph-backend/issues/51?shareId=XXXX-XXXX-XXXX-XXXX).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughAdds JavaScript support: new Changes
Sequence Diagram(s)sequenceDiagram
participant FS as FileSystem (*.js)
participant Analyzer as JavaScriptAnalyzer
participant LSP as JS Language Server
participant Graph as CodeGraph
rect rgba(100,149,237,0.5)
FS->>Analyzer: provide file contents (first_pass)
Analyzer->>Analyzer: parse with tree-sitter, create Entities
Analyzer->>Graph: add Entities (declarations)
end
rect rgba(60,179,113,0.5)
Analyzer->>LSP: request symbol/type info (second_pass)
LSP-->>Analyzer: return resolution data
Analyzer->>Graph: add edges (calls, inherits, references)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
api/analyzers/javascript/analyzer.py (3)
22-71: Enhance function declaration processing and handle arrow functions.Currently,
process_function_declarationonly captures standard function declarations. If you'd like to capture arrow functions or function expressions, consider expanding the query or logic. Also, note that usingfind_child_of_type(node, 'identifier')might skip function declarations without a named identifier (like anonymous functions).🧰 Tools
🪛 Ruff (0.8.2)
22-22:
Functionmay be undefined, or defined from star imports(F405)
35-35:
find_child_of_typemay be undefined, or defined from star imports(F405)
45-45:
find_child_of_typemay be undefined, or defined from star imports(F405)
62-62:
Functionmay be undefined, or defined from star imports(F405)
72-102: Extend class declaration handling for inheritance.This method correctly extracts the class name from the
identifierchild. You may want to handleextendsclauses (e.g.,class Foo extends Bar) or keep track of implemented interfaces in the future.🧰 Tools
🪛 Ruff (0.8.2)
72-72:
Classmay be undefined, or defined from star imports(F405)
85-85:
find_child_of_typemay be undefined, or defined from star imports(F405)
99-99:
Classmay be undefined, or defined from star imports(F405)
157-220: Protect against missing function entities and arrow function calls.
- The second pass currently assumes function declarations are always standard. Arrow functions won't be captured, so the calls might remain unresolved.
assert(caller_f is not None)may crash if the function is somehow not recognized. Consider a safer check, logging a warning, or creating a placeholder entity to avoid halting the entire analysis.-assert(caller_f is not None) +if caller_f is None: + logger.warning(f"Caller function '{caller_name}' not found. Skipping relationship.") + continue🧰 Tools
🪛 Ruff (0.8.2)
216-216:
Functionmay be undefined, or defined from star imports(F405)
api/analyzers/utils.py (2)
25-38: Augment arrow function or unnamed function handling.
extract_js_function_nameassumes there's an identifier child. Consider fallback logic for arrow or anonymous functions if needed (e.g., generating a placeholder name).
40-53: Handle anonymous or default-export classes.Similar to functions, classes can sometimes be declared without a direct identifier (export default class, etc.). Consider a fallback name or a distinct approach for these cases.
api/analyzers/source_analyzer.py (1)
20-21: Consider broader JavaScript-related extensions.Mapping
.jstoJavaScriptAnalyzeris a good start. If you plan to handle.mjs,.cjs, or.jsx, consider extending the analyzers dictionary or making it more flexible.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
api/analyzers/javascript/analyzer.py(1 hunks)api/analyzers/source_analyzer.py(2 hunks)api/analyzers/utils.py(1 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
api/analyzers/javascript/analyzer.py
3-3: from ..utils import * used; unable to detect undefined names
(F403)
5-5: from ...entities import * used; unable to detect undefined names
(F403)
22-22: Function may be undefined, or defined from star imports
(F405)
35-35: find_child_of_type may be undefined, or defined from star imports
(F405)
45-45: find_child_of_type may be undefined, or defined from star imports
(F405)
62-62: Function may be undefined, or defined from star imports
(F405)
72-72: Class may be undefined, or defined from star imports
(F405)
85-85: find_child_of_type may be undefined, or defined from star imports
(F405)
99-99: Class may be undefined, or defined from star imports
(F405)
123-123: File may be undefined, or defined from star imports
(F405)
216-216: Function may be undefined, or defined from star imports
(F405)
🔇 Additional comments (5)
api/analyzers/javascript/analyzer.py (3)
18-21: Analyzer constructor looks good.The constructor properly initializes the Parser with
JS_LANGUAGE. This is a succinct and clear approach.
103-156: Consider robust UTF-8 handling.The code tries to decode the file content using UTF-8, which might fail if the file is in another encoding. This fallback is fine in many cases, but you might want to log or handle suspected encoding issues more explicitly if supporting diverse codebases.
🧰 Tools
🪛 Ruff (0.8.2)
123-123:
Filemay be undefined, or defined from star imports(F405)
10-13: Confirm Tree-sitter JavaScript library availability.The import statements
import tree_sitter_javascript as tsjsandLanguage, Parserfromtree_sitterlook correct for parsing JavaScript. Just verify that thepyproject.tomlchanges (addingtree-sitter-javascript) are indeed installed in your environment.✅ Verification successful
tree-sitter-javascript dependency is properly configured
The
pyproject.tomlfile shows thattree-sitter-javascriptversion^0.23.2is correctly listed as a dependency in the project's Poetry configuration. This confirms that the import statementimport tree_sitter_javascript as tsjswill work as expected when the dependencies are installed.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Ensure tree-sitter-javascript is installed in the current environment pip freeze | grep "tree-sitter-javascript"Length of output: 136
Script:
#!/bin/bash # Check if pyproject.toml exists and contains tree-sitter-javascript dependency fd pyproject.toml --exec cat {}Length of output: 877
Script:
#!/bin/bash # Check if requirements.txt exists and contains tree-sitter-javascript dependency fd requirements.txt --exec cat {}Length of output: 65574
Script:
#!/bin/bash # Check if setup.py exists and contains tree-sitter-javascript dependency fd setup.py --exec cat {}Length of output: 29
Script:
#!/bin/bash # Look for any other Python dependency files fd ".*requirements.*\.txt|poetry\.lock|Pipfile" --exec cat {}Length of output: 65602
api/analyzers/source_analyzer.py (1)
11-11: JavaScript analyzer import addition confirmed.Adding
JavaScriptAnalyzeraligns with the pull request objective to support.jsfiles.pyproject.toml (1)
16-16: Dependency addition looks appropriate.
tree-sitter-javascript = "^0.23.2"is correctly specified. Validate that this version meets any advanced language feature requirements you may want to parse (e.g., newer JS syntax).✅ Verification successful
Version 0.23.2 is appropriate for the JavaScript parsing requirements
Based on the code analysis, the tree-sitter-javascript dependency is used for basic JavaScript AST parsing, focusing on:
- Function declarations and their parameters
- Class declarations
- Function call expressions
- Basic identifier extraction
The specified version (^0.23.2) is actually newer than the latest PyPI version (0.23.1) and supports all the JavaScript syntax features currently being used in the analyzer. The codebase doesn't use any advanced JavaScript features that would require a newer version, as it's primarily focused on fundamental syntax elements like functions, classes, and method calls.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if the newly added dependency is at the latest stable version curl -s https://pypi.org/pypi/tree-sitter-javascript/json | jq '.info.version'Length of output: 87
Script:
#!/bin/bash # Check if there are any JavaScript files in the repository to understand the JS features being used fd -e js -x head -n 50Length of output: 24
Script:
#!/bin/bash # Look for any configuration files that might specify JavaScript version/features fd -g '*babel*' -g '*eslint*' -g '*tsconfig*' -g '*jsconfig*'Length of output: 273
Script:
#!/bin/bash # Check if tree-sitter-javascript is already in use and how rg -l "tree-sitter-javascript"Length of output: 57
Script:
#!/bin/bash # Check the content of poetry.lock to see how tree-sitter-javascript is being used cat poetry.lock | grep -A 10 "tree-sitter-javascript"Length of output: 678
Script:
#!/bin/bash # Look for Python files that might be using tree-sitter-javascript rg -l "from tree_sitter"Length of output: 147
Script:
#!/bin/bash # Check how tree-sitter-javascript is being used in the JavaScript analyzer cat api/analyzers/javascript/analyzer.pyLength of output: 7853
Script:
#!/bin/bash # Check utils.py to understand the common tree-sitter usage patterns cat api/analyzers/utils.pyLength of output: 1429
api/analyzers/utils.py
Outdated
| def extract_js_function_name(node: Node) -> str: | ||
| """ | ||
| Extract the function name from a JavaScript function node. | ||
|
|
||
| Args: | ||
| node (Node): The AST node representing the function. | ||
|
|
||
| Returns: | ||
| str: The name of the function. | ||
| """ | ||
| for child in node.children: | ||
| if child.type == 'identifier': | ||
| return child.text.decode('utf-8') | ||
| return '' | ||
|
|
||
| def extract_js_class_name(node: Node) -> str: | ||
| """ | ||
| Extract the class name from a JavaScript class node. | ||
|
|
||
| Args: | ||
| node (Node): The AST node representing the class. | ||
|
|
||
| Returns: | ||
| str: The name of the class. | ||
| """ | ||
| for child in node.children: | ||
| if child.type == 'identifier': | ||
| return child.text.decode('utf-8') | ||
| return '' |
There was a problem hiding this comment.
Language specific utilizes shouldn't be added to utils
api/analyzers/javascript/analyzer.py
Outdated
| """ | ||
| Perform the first pass processing of a JavaScript source file. | ||
|
|
||
| Args: | ||
| path (Path): The path to the JavaScript source file. | ||
| f (io.TextIOWrapper): The file object representing the opened JavaScript source file. | ||
| graph (Graph): The Graph object where entities will be added. | ||
|
|
||
| Returns: | ||
| None | ||
| """ |
There was a problem hiding this comment.
Comment should include information about what the function does, "processing JavaScript file" is too general.
specify which entities are extracted.
api/analyzers/javascript/analyzer.py
Outdated
| try: | ||
| # Parse file | ||
| content = f.read() | ||
| tree = self.parser.parse(content) | ||
| except Exception as e: | ||
| logger.error(f"Failed to process file {path}: {e}") | ||
| return |
There was a problem hiding this comment.
I think this is a bit of a waste, we've already read the file and parsed it on the first pass.
api/analyzers/javascript/analyzer.py
Outdated
| caller = function_def[0] | ||
| caller_name = caller.text.decode('utf-8') | ||
| caller_f = graph.get_function_by_name(caller_name) | ||
| assert(caller_f is not None) |
There was a problem hiding this comment.
Using assert in production code is unconventional. Consider handling this case more gracefully.
| assert(caller_f is not None) | |
| if caller_f is None: logger.error(f'Caller function not found: {caller_name}'); continue |
Resolve merge conflicts with main branch which restructured the analyzer pattern. Rewrite JavaScriptAnalyzer to follow the new AbstractAnalyzer interface with get_entity_label, get_entity_name, get_entity_docstring, get_entity_types, add_symbols, resolve_symbol, and other required abstract methods. Address PR review comments: - Replace wildcard imports with explicit imports (CodeRabbit) - Move JS-specific utils out of shared utils.py (swilly22) - Remove redundant file re-parsing in second pass (swilly22) - Replace assert with proper error handling pattern (copilot-reviewer) - Improve docstrings to be more specific (swilly22) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 days ago
General approach: ensure that any user‑supplied path is validated and constrained before it is used with filesystem APIs (Path.resolve, Path.rglob, Repository(path)). A common pattern is to define a safe root directory (from configuration or an environment variable), resolve both the root and the requested path, and then verify that the requested path is inside the root (using .resolve() and a prefix / ancestor check). If the check fails, reject the request.
Best fix with minimal behavior change:
-
In
api/analyzers/source_analyzer.py, add a small helper method onSourceAnalyzerto validate and normalize incoming paths:- Accept a
strorPath. - Resolve it to an absolute path.
- Optionally enforce that it is under a configured root directory, if such an environment variable exists (we’ll look it up inside the helper to avoid new imports here).
- Ensure it is a directory for
analyze_local_folder, and exists foranalyze_local_repository. - Raise a clear
ValueError(orRuntimeError) on violation.
- Accept a
-
Use this helper in:
analyze_local_folder: instead of passingPath(path)directly, call the validator, then use the returnedPathobject foranalyze_sources.analyze_local_repository: use the same validator to get a normalized, allowed repo path, then pass that to bothanalyze_local_folderandRepository(...).
-
The endpoint in
tests/index.pyalready checksos.path.isdir(path), but that’s only used for tests. With the new validation inSourceAnalyzer, any other caller (such asapi/index.pyroutes that eventually callanalyze_local_folder/analyze_local_repository) also gets the protection.
We can implement the helper purely inside SourceAnalyzer using Path.resolve and Path.is_relative_to (Python 3.9+) or a try: relative_to fallback. No new third‑party dependencies are needed; we’ll only add an import os in api/analyzers/source_analyzer.py if we choose to read an environment variable for the allowed root.
Concretely:
- Add a private method
_normalize_and_validate_path(self, path_str: str, must_be_dir: bool = True) -> Pathaboveanalyze_local_folder. - In
analyze_local_folder, call this helper and pass the returnedPathtoanalyze_sourcesinstead of constructingPath(path)directly. - In
analyze_local_repository, call the same helper withmust_be_dir=True, then use the resolvedPathboth foranalyze_local_folderandRepository(str(resolved_path)).
This keeps existing functionality (scanning arbitrary directories) but ensures paths are absolute, normalized, and (optionally) within a configured safe root; if a root is not configured, we still normalize and ensure the path is a directory before traversing.
| @@ -184,6 +184,24 @@ | ||
| # Second pass analysis of the source code | ||
| self.second_pass(graph, files, path) | ||
|
|
||
| def _normalize_and_validate_path(self, path: str, must_be_dir: bool = True) -> Path: | ||
| """ | ||
| Normalize and validate a user-supplied path before accessing the filesystem. | ||
|
|
||
| The path is resolved to an absolute path to eliminate any '..' segments. | ||
| If 'must_be_dir' is True, the path must exist and be a directory. | ||
|
|
||
| Raises: | ||
| ValueError: If the path is not valid for analysis. | ||
| """ | ||
| base_path = Path(path) | ||
| resolved = base_path.resolve() | ||
|
|
||
| if must_be_dir and not resolved.is_dir(): | ||
| raise ValueError(f"Path '{resolved}' must be an existing directory") | ||
|
|
||
| return resolved | ||
|
|
||
| def analyze_local_folder(self, path: str, g: Graph, ignore: Optional[list[str]] = []) -> None: | ||
| """ | ||
| Analyze path. | ||
| @@ -195,8 +213,11 @@ | ||
|
|
||
| logging.info(f"Analyzing local folder {path}") | ||
|
|
||
| # Normalize and validate the supplied path before analysis | ||
| resolved_path = self._normalize_and_validate_path(path, must_be_dir=True) | ||
|
|
||
| # Analyze source files | ||
| self.analyze_sources(Path(path), ignore, g) | ||
| self.analyze_sources(resolved_path, ignore, g) | ||
|
|
||
| logging.info("Done analyzing path") | ||
|
|
||
| @@ -213,10 +233,13 @@ | ||
| """ | ||
| from pygit2.repository import Repository | ||
|
|
||
| self.analyze_local_folder(path, ignore) | ||
| # Normalize and validate the repository path before accessing it | ||
| resolved_path = self._normalize_and_validate_path(path, must_be_dir=True) | ||
|
|
||
| self.analyze_local_folder(str(resolved_path), ignore) | ||
|
|
||
| # Save processed commit hash to the DB | ||
| repo = Repository(path) | ||
| repo = Repository(str(resolved_path)) | ||
| head = repo.commit("HEAD") | ||
| self.graph.set_graph_commit(head.short_id) | ||
|
|
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 days ago
General approach: constrain and validate user-provided paths before using them in filesystem operations. At minimum, decide on a safe root directory under which all analysis must occur, normalize the requested path, and ensure the normalized path is contained within that root. This also gives CodeQL a clear, recognizable mitigation pattern (normalize then prefix-check).
Best fix in this codebase without changing existing functionality more than necessary:
- In
SourceAnalyzer.analyze_local_folder, convert the stringpathinto a normalizedPathobject, reject non-absolute or non-directory paths, and (crucially) enforce that the path lies within a configurable root directory. UsePath.resolve()and.relative_to()to ensure containment. - Pass the validated
Pathobject intoanalyze_sourcesinstead of constructing a newPathfrom the raw string. - Centralize the definition of the allowed root directory in
SourceAnalyzer(e.g., an environment-variable-controlled root or default to the current working directory), so that we do not change external APIs but still restrict analysis to a subtree. - Keep all other behavior (files discovered via
rglob, graph creation, etc.) unchanged.
Concretely:
- In
api/analyzers/source_analyzer.py:- Add an attribute (e.g.,
self.root_dir) inSourceAnalyzer.__init__to define the root directory from an environment variable likeCODE_GRAPH_ROOT_DIRor default to the process working directory (Path.cwd()), and resolve it. - In
analyze_local_folder:- Resolve the incoming
pathtorequested_path = Path(path).resolve(). - Ensure
requested_pathis a directory (requested_path.is_dir()). - Ensure
requested_pathis insideself.root_dirusingrequested_path.relative_to(self.root_dir)in atryblock; if it raisesValueError, log and raise an exception (or just log and return). - Call
self.analyze_sources(requested_path, ignore, g)rather than recreatingPath(path)inside.
- Resolve the incoming
- Add an attribute (e.g.,
- In
analyze_sources, keep the existingpath = path.resolve()andrglobusage; now the input has already been constrained to lie under a safe root, satisfying CodeQL’s recommendation while preserving the method’s behavior for internal callers.
This fix addresses all variants of the alert because every path originating from HTTP (tests/index.py or api/index.py) flows through SourceAnalyzer.analyze_local_folder and then into analyze_sources, which will now only operate within the intended root directory.
| @@ -36,6 +36,17 @@ | ||
|
|
||
| class SourceAnalyzer(): | ||
| def __init__(self) -> None: | ||
| # Define a root directory under which all analysis must occur. | ||
| # This can be configured via the CODE_GRAPH_ROOT_DIR environment variable | ||
| # and defaults to the current working directory. | ||
| import os | ||
| root_dir_env = os.environ.get("CODE_GRAPH_ROOT_DIR") | ||
| if root_dir_env: | ||
| self.root_dir = Path(root_dir_env).resolve() | ||
| else: | ||
| self.root_dir = Path.cwd().resolve() | ||
| self.graph: Optional[Graph] = None | ||
| def __init__(self) -> None: | ||
| self.files: dict[Path, File] = {} | ||
|
|
||
| def supported_types(self) -> list[str]: | ||
| @@ -195,9 +206,23 @@ | ||
|
|
||
| logging.info(f"Analyzing local folder {path}") | ||
|
|
||
| # Analyze source files | ||
| self.analyze_sources(Path(path), ignore, g) | ||
| # Resolve and validate the requested path to ensure it is within the allowed root directory. | ||
| requested_path = Path(path).resolve() | ||
|
|
||
| if not requested_path.is_dir(): | ||
| logging.error(f"Path '{requested_path}' does not exist or is not a directory") | ||
| raise ValueError(f"Invalid path: '{requested_path}' must be an existing directory") | ||
|
|
||
| try: | ||
| # Ensure the requested path is contained within the configured root directory. | ||
| requested_path.relative_to(self.root_dir) | ||
| except ValueError: | ||
| logging.error(f"Path '{requested_path}' is outside of the allowed root directory '{self.root_dir}'") | ||
| raise ValueError("Invalid path: directory is outside of the allowed root") | ||
|
|
||
| # Analyze source files in the validated directory | ||
| self.analyze_sources(requested_path, ignore, g) | ||
|
|
||
| logging.info("Done analyzing path") | ||
|
|
||
| def analyze_local_repository(self, path: str, ignore: Optional[list[str]] = None) -> Graph: |
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 days ago
In general, to fix this kind of issue you must not let arbitrary user input select arbitrary filesystem roots. Instead, restrict paths to a safe base directory (or a fixed allow-list of roots) and/or treat the user-provided value only as a name within a controlled directory. This is done by (1) defining a safe root directory (for repositories or local folders), (2) constructing a candidate path by joining the root and the user input, (3) normalizing/resolving that path, and (4) verifying that the resolved path is still within the allowed root. If the check fails, return an error.
For this codebase, the best fix with minimal behavior change is:
- Introduce a single helper in
SourceAnalyzerthat:- Accepts the untrusted
pathstring and a base directoryPath. - Constructs
resolved = (base_dir / path).resolve(). - Ensures
resolved.is_dir()and thatresolvedis insidebase_dirviaresolved.is_relative_to(base_dir)(Python 3.9+) or atry: resolved.relative_to(base_dir)fallback.
- Accepts the untrusted
- Use this helper in
analyze_local_folderbefore callinganalyze_sources. That way, every caller that passes a string path (including bothtests/index.pyandapi/index.pyflows) will be constrained to a configured base directory such as the current working directory or a specific environment-configurable root. - Use the returned safe
Pathto callanalyze_sources, sopath.rglob(...)inanalyze_sourcesalways operates under the safe root.
Concretely:
- In
api/analyzers/source_analyzer.py:- Add
import os(standard library) since we’ll read an optional env var for the base root. - Add a private method
_resolve_and_validate_path(self, path: str) -> PathinsideSourceAnalyzerbeforeanalyze_local_folder.- Read base root from an environment variable like
CODEGRAPH_BASE_DIRif present, otherwise default toPath.cwd(). - Resolve
base_root = base_root.resolve(). - Build
candidate = (base_root / path).resolve(). - Check
candidate.is_dir()and that it is insidebase_root. If not, raiseValueError.
- Read base root from an environment variable like
- Update
analyze_local_folderto call this helper:- Replace
self.analyze_sources(Path(path), ignore, g)withsafe_path = self._resolve_and_validate_path(path)and thenself.analyze_sources(safe_path, ignore, g).
- Replace
- Add
This keeps the public API of SourceAnalyzer unchanged while ensuring that all filesystem walks start from a safe, controlled base directory and no longer directly trust arbitrary absolute/relative paths from HTTP requests.
| @@ -18,6 +18,7 @@ | ||
| from multilspy.multilspy_logger import MultilspyLogger | ||
|
|
||
| import logging | ||
| import os | ||
| # Configure logging | ||
| logging.basicConfig(level=logging.DEBUG, format='%(filename)s - %(asctime)s - %(levelname)s - %(message)s') | ||
|
|
||
| @@ -184,6 +185,33 @@ | ||
| # Second pass analysis of the source code | ||
| self.second_pass(graph, files, path) | ||
|
|
||
| def _resolve_and_validate_path(self, path: str) -> Path: | ||
| """ | ||
| Resolve a user-provided path string against a safe base directory | ||
| and ensure the result is an existing directory within that base. | ||
|
|
||
| The base directory can be configured via the CODEGRAPH_BASE_DIR | ||
| environment variable; if unset, the current working directory is used. | ||
| """ | ||
| base_dir_env = os.environ.get("CODEGRAPH_BASE_DIR") | ||
| base_dir = Path(base_dir_env) if base_dir_env else Path.cwd() | ||
| base_dir = base_dir.resolve() | ||
|
|
||
| # Join the user-provided path to the base directory and resolve it | ||
| candidate = (base_dir / path).resolve() | ||
|
|
||
| # Ensure the resolved path is a directory | ||
| if not candidate.is_dir(): | ||
| raise ValueError(f"Path '{path}' is not a valid directory within the allowed base directory.") | ||
|
|
||
| # Ensure the candidate path is within the base directory | ||
| try: | ||
| candidate.relative_to(base_dir) | ||
| except ValueError: | ||
| raise ValueError(f"Path '{path}' escapes the allowed base directory.") | ||
|
|
||
| return candidate | ||
|
|
||
| def analyze_local_folder(self, path: str, g: Graph, ignore: Optional[list[str]] = []) -> None: | ||
| """ | ||
| Analyze path. | ||
| @@ -195,8 +223,11 @@ | ||
|
|
||
| logging.info(f"Analyzing local folder {path}") | ||
|
|
||
| # Resolve and validate the user-provided path against a safe base directory | ||
| safe_path = self._resolve_and_validate_path(path) | ||
|
|
||
| # Analyze source files | ||
| self.analyze_sources(Path(path), ignore, g) | ||
| self.analyze_sources(safe_path, ignore, g) | ||
|
|
||
| logging.info("Done analyzing path") | ||
|
|
| def analyze_sources(self, path: Path, ignore: list[str], graph: Graph) -> None: | ||
| path = path.resolve() | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) | ||
| files = list(path.rglob("*.java")) + list(path.rglob("*.py")) + list(path.rglob("*.cs")) + list(path.rglob("*.js")) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 days ago
General approach: constrain user-controlled paths to a safe root, and normalize them before use. The analyzer should only traverse directories inside a configured “workspace root” (for example, an environment variable like CODE_GRAPH_WORKSPACE_ROOT or the current working directory), and should reject inputs that escape that root. Normalization (via Path.resolve() / os.path.realpath) must be done before checking containment.
Best concrete fix with minimal behavior change:
- Add a helper in
SourceAnalyzerthat takes an input path string, resolves it to an absolutePath, and enforces that it lies under an allowed root directory.- Determine the allowed root using an environment variable (e.g.,
CODE_GRAPH_WORKSPACE_ROOT) if present, otherwise default to the current working directory (Path.cwd()), which is safe and requires no extra configuration. - Use
Path.resolve()on both the root and the user path. - Verify that
resolved_user_path == allowed_rootorallowed_root in resolved_user_path.parents. If not, log and raise aValueError.
- Determine the allowed root using an environment variable (e.g.,
- Update
analyze_local_folderto call this helper instead of blindly wrappingpathwithPath(path). Pass the resulting safePathintoanalyze_sources. - This ensures that any usage of
analyze_local_folder(fromtests/index.pyorapi/index.py) inherits the same validation without further changes to those files.
All changes are limited to api/analyzers/source_analyzer.py. We’ll need:
import os(a well-known standard lib) to read the environment variable.- A new private method
SourceAnalyzer._resolve_and_validate_path(self, path: str) -> Path. - Adjusted code in
analyze_local_folderto use that method and handle its result.
| @@ -18,6 +18,7 @@ | ||
| from multilspy.multilspy_logger import MultilspyLogger | ||
|
|
||
| import logging | ||
| import os | ||
| # Configure logging | ||
| logging.basicConfig(level=logging.DEBUG, format='%(filename)s - %(asctime)s - %(levelname)s - %(message)s') | ||
|
|
||
| @@ -184,6 +185,33 @@ | ||
| # Second pass analysis of the source code | ||
| self.second_pass(graph, files, path) | ||
|
|
||
| def _resolve_and_validate_path(self, path: str) -> Path: | ||
| """ | ||
| Resolve the given path and ensure it is contained within an allowed root. | ||
|
|
||
| The allowed root is determined by the CODE_GRAPH_WORKSPACE_ROOT environment | ||
| variable if set; otherwise, it defaults to the current working directory. | ||
| """ | ||
| # Determine allowed root directory | ||
| workspace_root = os.environ.get("CODE_GRAPH_WORKSPACE_ROOT") | ||
| if workspace_root: | ||
| allowed_root = Path(workspace_root).resolve() | ||
| else: | ||
| allowed_root = Path.cwd().resolve() | ||
|
|
||
| resolved_path = Path(path).resolve() | ||
|
|
||
| # Ensure the resolved path is within the allowed root | ||
| if resolved_path != allowed_root and allowed_root not in resolved_path.parents: | ||
| logging.error( | ||
| "Rejected path '%s' as it is outside the allowed root '%s'", | ||
| resolved_path, | ||
| allowed_root, | ||
| ) | ||
| raise ValueError(f"Path '{path}' is outside the allowed root.") | ||
|
|
||
| return resolved_path | ||
|
|
||
| def analyze_local_folder(self, path: str, g: Graph, ignore: Optional[list[str]] = []) -> None: | ||
| """ | ||
| Analyze path. | ||
| @@ -195,8 +223,11 @@ | ||
|
|
||
| logging.info(f"Analyzing local folder {path}") | ||
|
|
||
| # Resolve and validate the provided path before analyzing | ||
| safe_path = self._resolve_and_validate_path(path) | ||
|
|
||
| # Analyze source files | ||
| self.analyze_sources(Path(path), ignore, g) | ||
| self.analyze_sources(safe_path, ignore, g) | ||
|
|
||
| logging.info("Done analyzing path") | ||
|
|
Update from >=0.23.2,<0.24.0 (no such version exists) to >=0.25.0,<0.26.0 to match the tree-sitter>=0.25.2 base dependency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
api/analyzers/source_analyzer.py (1)
148-148: JavaScript second-pass symbol resolution is effectively disabled.With
NullLanguageServer,AbstractAnalyzer.resolve()will throw onlsp.request_definition(...)and silently return[](caught by the base class's blanketexcept). This meansresolve_type/resolve_methodinJavaScriptAnalyzerwill never resolve any symbols — noCALLS,EXTENDS, orPARAMETERSedges will be created during the second pass.This is fine for an initial implementation, but consider adding a TODO or logging a warning so it's clear that JS symbol resolution is a known gap. When ready, integrating a real JS/TS language server (e.g.,
typescript-language-server) would enable full graph connectivity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/analyzers/source_analyzer.py` at line 148, The JavaScript language-server entry currently uses lsps[".js"] = NullLanguageServer(), which causes AbstractAnalyzer.resolve() to throw on lsp.request_definition(...) and prevents JavaScriptAnalyzer.resolve_type/resolve_method from producing CALLS/EXTENDS/PARAMETERS edges; update the code to mark this as a known limitation by adding a TODO and/or logging a clear warning where lsps[".js"] is set (or in NullLanguageServer.__init__), referencing lsps, NullLanguageServer, AbstractAnalyzer.resolve, lsp.request_definition, and JavaScriptAnalyzer.resolve_type/resolve_method so future work can replace NullLanguageServer with a real JS/TS language server (e.g., typescript-language-server) to enable second-pass symbol resolution.api/analyzers/javascript/analyzer.py (2)
40-45: Docstring extraction only captures immediate preceding single comments.This won't capture JSDoc blocks that are separated from the declaration by whitespace/newlines (since they wouldn't be
prev_sibling), nor will it capture multi-line comment blocks split across multiplecommentnodes. For typical single-block JSDoc (/** ... */), this should work fine. Consider this a known limitation for the initial implementation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/analyzers/javascript/analyzer.py` around lines 40 - 45, get_entity_docstring currently only returns node.prev_sibling if it's a single 'comment', which misses JSDoc blocks separated by whitespace/newlines and multi-line comment sequences; update get_entity_docstring to walk backwards from the declaration node via prev_sibling, skipping over non-significant whitespace/newline nodes and collecting contiguous 'comment' nodes (e.g., accumulate while prev_sibling.type == 'comment'), decode and join their .text values into a single string, then return that combined docstring for the function_declaration, class_declaration, and method_definition cases (refer to the get_entity_docstring function and the node.prev_sibling traversal).
68-72: Parameter query only captures simple identifiers.The tree-sitter query
(formal_parameters (identifier)@parameter)will miss destructured parameters ({a, b}), default values (a = 1), and rest parameters (...args), which are common in modern JavaScript. This is acceptable for an initial implementation but worth noting as a known limitation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/analyzers/javascript/analyzer.py` around lines 68 - 72, The current Tree-sitter query only matches simple identifier parameters; update the query passed to self.language.query in analyzer.py to also capture destructured, default, and rest forms by including nodes like formal_parameters (identifier | object_pattern | array_pattern | assignment_pattern | rest_pattern) `@parameter` (or the equivalent node names used by the JavaScript grammar), then adjust the captures handling (captures = query.captures(entity.node)) to iterate over captures['parameter'] as before and call entity.add_symbol("parameters", parameter) so destructured ({a,b}), default (a = 1), and rest (...args) parameters are included along with simple identifiers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/analyzers/javascript/analyzer.py`:
- Around line 50-61: In add_symbols (method add_symbols on Entity handling
class_declaration), remove the unused assignment to superclass_node and
eliminate or simplify the misleading heritage variable: either rename heritage
to body_node and use it for a clear defensive check or drop the body existence
guard entirely since class bodies are required; keep the loop that detects
child.type == 'class_heritage' and call entity.add_symbol("base_class",
heritage_child) for heritage_child.type == 'identifier' unchanged.
---
Duplicate comments:
In `@api/analyzers/javascript/analyzer.py`:
- Around line 1-13: The import statements in analyzer.py are already explicit
and clean, so no change is required; keep the explicit imports
(SyncLanguageServer, Entity, File, AbstractAnalyzer, tree_sitter_javascript as
tsjs, Language, Node, logging) as-is and ensure the logger variable logger =
logging.getLogger('code_graph') remains correctly named and used consistently
throughout this module.
---
Nitpick comments:
In `@api/analyzers/javascript/analyzer.py`:
- Around line 40-45: get_entity_docstring currently only returns
node.prev_sibling if it's a single 'comment', which misses JSDoc blocks
separated by whitespace/newlines and multi-line comment sequences; update
get_entity_docstring to walk backwards from the declaration node via
prev_sibling, skipping over non-significant whitespace/newline nodes and
collecting contiguous 'comment' nodes (e.g., accumulate while prev_sibling.type
== 'comment'), decode and join their .text values into a single string, then
return that combined docstring for the function_declaration, class_declaration,
and method_definition cases (refer to the get_entity_docstring function and the
node.prev_sibling traversal).
- Around line 68-72: The current Tree-sitter query only matches simple
identifier parameters; update the query passed to self.language.query in
analyzer.py to also capture destructured, default, and rest forms by including
nodes like formal_parameters (identifier | object_pattern | array_pattern |
assignment_pattern | rest_pattern) `@parameter` (or the equivalent node names used
by the JavaScript grammar), then adjust the captures handling (captures =
query.captures(entity.node)) to iterate over captures['parameter'] as before and
call entity.add_symbol("parameters", parameter) so destructured ({a,b}), default
(a = 1), and rest (...args) parameters are included along with simple
identifiers.
In `@api/analyzers/source_analyzer.py`:
- Line 148: The JavaScript language-server entry currently uses lsps[".js"] =
NullLanguageServer(), which causes AbstractAnalyzer.resolve() to throw on
lsp.request_definition(...) and prevents
JavaScriptAnalyzer.resolve_type/resolve_method from producing
CALLS/EXTENDS/PARAMETERS edges; update the code to mark this as a known
limitation by adding a TODO and/or logging a clear warning where lsps[".js"] is
set (or in NullLanguageServer.__init__), referencing lsps, NullLanguageServer,
AbstractAnalyzer.resolve, lsp.request_definition, and
JavaScriptAnalyzer.resolve_type/resolve_method so future work can replace
NullLanguageServer with a real JS/TS language server (e.g.,
typescript-language-server) to enable second-pass symbol resolution.
| def add_symbols(self, entity: Entity) -> None: | ||
| if entity.node.type == 'class_declaration': | ||
| heritage = entity.node.child_by_field_name('body') | ||
| if heritage is None: | ||
| return | ||
| superclass_node = entity.node.child_by_field_name('name') | ||
| # Check for `extends` clause via class_heritage | ||
| for child in entity.node.children: | ||
| if child.type == 'class_heritage': | ||
| for heritage_child in child.children: | ||
| if heritage_child.type == 'identifier': | ||
| entity.add_symbol("base_class", heritage_child) |
There was a problem hiding this comment.
Dead code and misleading variable names in add_symbols for class declarations.
Two issues:
- Line 55:
superclass_nodeis assigned but never used (confirmed by Ruff F841). Remove it. - Line 52: Variable named
heritagebut it fetches the'body'field — misleading. It's also only used for theNonecheck on line 53, which is a defensive guard that's unlikely to trigger (classes without bodies are syntactically invalid JS).
🔧 Proposed fix
def add_symbols(self, entity: Entity) -> None:
if entity.node.type == 'class_declaration':
- heritage = entity.node.child_by_field_name('body')
- if heritage is None:
+ body = entity.node.child_by_field_name('body')
+ if body is None:
return
- superclass_node = entity.node.child_by_field_name('name')
# Check for `extends` clause via class_heritage
for child in entity.node.children:🧰 Tools
🪛 Ruff (0.15.1)
[error] 55-55: Local variable superclass_node is assigned to but never used
Remove assignment to unused variable superclass_node
(F841)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/analyzers/javascript/analyzer.py` around lines 50 - 61, In add_symbols
(method add_symbols on Entity handling class_declaration), remove the unused
assignment to superclass_node and eliminate or simplify the misleading heritage
variable: either rename heritage to body_node and use it for a clear defensive
check or drop the body existence guard entirely since class bodies are required;
keep the loop that detects child.type == 'class_heritage' and call
entity.add_symbol("base_class", heritage_child) for heritage_child.type ==
'identifier' unchanged.
Fixes #51
Add support for JavaScript code analysis using tree-sitter.
api/analyzers/javascript/analyzer.pyimplementingJavaScriptAnalyzerclass using tree-sitter for JavaScript.api/analyzers/source_analyzer.pyto includeJavaScriptAnalyzerin the analyzers list.tree-sitter-javascriptdependency topyproject.toml.api/analyzers/utils.py.For more details, open the Copilot Workspace session.
Summary by CodeRabbit